SQUAD OF FIFA 2018 by SAHIL CHUTANI

Introduction

Where do most of the players in FIFA 2018 come from? Is it South America or Europe? What is the most common age of the players listed in FIFA 2018? What is the age range of players? What is the distribution of their performance? These are the questions I would like to find an answer for through Exploratory Data Analysis. I will make use of the ggplot2 library that I learnt in the lesson coupled with plotly for interactive visualization.

Dataset

The dataset features every player in Fifa 2018 with 70+ attributes. It contains personal attributes like Nationality, Photo, Club Age, Wage, Salary etc. I downloaded dataset from https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset.

Dataset is tidy except for a few columns like the Wage, Value and Preferred.Positions. I would extract the numeric values from Wage and Value columns, and pull out the most preferred position from the Preferred.Positions column with the assumption the position are in order of preference.

Summary of Fifa 2018

##             Name            Age       
##  J. Rodríguez:    7   Min.   :16.00  
##  J. Valencia  :    7   1st Qu.:21.00  
##  J. Williams  :    7   Median :25.00  
##  D. González :    6   Mean   :25.14  
##  Danilo       :    6   3rd Qu.:28.00  
##  Felipe       :    6   Max.   :47.00  
##  (Other)      :17942                  
##                                              Photo      
##  https://cdn.sofifa.org/48/18/players/197083.png:    2  
##  https://cdn.sofifa.org/48/18/players/198113.png:    2  
##  https://cdn.sofifa.org/48/18/players/198140.png:    2  
##  https://cdn.sofifa.org/48/18/players/198329.png:    2  
##  https://cdn.sofifa.org/48/18/players/198584.png:    2  
##  https://cdn.sofifa.org/48/18/players/198614.png:    2  
##  (Other)                                        :17969  
##  Nationality           Overall        Potential    
##  Length:17981       Min.   :46.00   Min.   :46.00  
##  Class :character   1st Qu.:62.00   1st Qu.:67.00  
##  Mode  :character   Median :66.00   Median :71.00  
##                     Mean   :66.25   Mean   :71.19  
##                     3rd Qu.:71.00   3rd Qu.:75.00  
##                     Max.   :94.00   Max.   :94.00  
##                                                    
##                 Club          Value               Wage          
##                   :  248   Length:17981       Length:17981      
##  Villarreal CF    :   35   Class :character   Class :character  
##  Borussia Dortmund:   34   Mode  :character   Mode  :character  
##  FC Nantes        :   34                                        
##  Manchester United:   34                                        
##  OGC Nice         :   34                                        
##  (Other)          :17562                                        
##  Preferred.Positions  Continent        
##  Length:17981        Length:17981      
##  Class :character    Class :character  
##  Mode  :character    Mode  :character  
##                                        
##                                        
##                                        
## 

Observations

  1. Age ranges from 16 to 47 years with a mean of 25.14 and a median 25. I am thinking of a normal distribution of Age. I would plot a histogram in univariate plots section to see if this is the case.

  2. Looking at the Nationality column. Top 5 countries are all from from either Europe or South America. In the univariate plot section I would perform a group by operation by Nationality and plot on a map to visualize the distribution of players by country.

  3. The Overall and Potential columns both range from 46 to 94 with mean 66 and 71 respectively. the 5 point difference in mean makes me wonder how many players have scope of improvement. I would like to explore difference of the two columns in the plot section below.I expect these two columns to be heavily correlated.

Univariate Plots Section


Indeed the age distribution looks normal. 1522 players are Aged 25 years and most of the players are clustered around 25 years. I expected such observation.


No surprises here either most players have an Overall score of 66.



The plot shows that most players have a potential to score 70 points, it is 1 point less than the potential score mean.

Most players are already at their best. I observe that some players have a potential to score more than 10 points than they currently do. I wonder the belong to which countries. I would explore this further when I visualize the distributions on world map.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.00   14.00   21.00   31.95   36.00  565.00   12727


The wage variable has a lot of NAs. I will discard this variable from any further analysis.



The plot above looks positively skewed but I will not make any conclusions based on it because many values are missing.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      10     300     625    2252    1600  123000    1548


The value variable is very intriguing. Median value is 625K, meaning half the players are valued less than 625K and half are more than 625K. The 3rd quartile is 1.6M and the maximum value is 123M. Infact I expected such observation, because most players are not valued in the millions but I would like to explore further about the high valued plyers.



It is an intresting plot but a one that should be expected. Given the competing nature of fifa most players are not valued in 100s of million dollars, as can be seen from the plot. The plot is positively skewed as expected.
## # A tibble: 6 x 11
##      Nationality mean_Overall max_Overall mean_Potential max_Potential
##            <chr>        <dbl>       <dbl>          <dbl>         <dbl>
## 1 United Kingdom     63.08509          89       69.93548            90
## 2        Germany     65.90088          92       71.57982            92
## 3          Spain     69.93916          90       74.78214            92
## 4         France     67.28630          88       73.02454            94
## 5      Argentina     67.76580          93       72.45907            93
## 6         Brazil     70.89532          92       72.86700            94
## # ... with 6 more variables: mean_Age <dbl>, mean_Diff <dbl>,
## #   max_Diff <dbl>, mean_Value <dbl>, max_Value <dbl>, n <int>


Clearly the most redder regions are in South America and Europe. UK has the highest number of players.Most of Asia and Africa are grey in color, meaning less than 60 players are from these regions. In the middle East, there is a stark contrast between Nations, Saudi Arabia is much redder than other nations. Surprising observations are from Canada and New Zealand, both are high income countries but are grey in color, perhaps population impacts the number of players from a country.



Center Back is the most preferred position and Right Wing Back is the least preferred position. I wonder if preferrence has an impact on value.

Univariate Analysis

After exploring dataset for various variables. I have following conclusions:

  1. Most of the players in FIFA 2018 belong to Europe or South America. UK has the highest number of players.
  2. Most players have already attained their potential as their overall is equal to potential.
  3. Center Back (CB) is the most preferred position amongst players and Right Wing Back (RWB) is the least preferred position.

The dataset is tidy. Apart from a few changes like extracting numbers from a variable, I don’t need to make any more changes.

Most interesting features in the dataset are Nationality, Age, Potential, Overall, Value, and Preferred.Positions. A brief description of the features is as follows:

  1. Nationality -> Nationality of the player.
  2. Age -> Age of player
  3. Potential -> The potential of player.
  4. Overall -> The current overall standing of the player.
  5. Value -> What is the players value in Thousands of pounds.
  6. Preferred.Positions -> The preferred position of the player.


Other features like continent might be helpful, I would explore if it is.


I created a variable PO_Diff, which accounts for difference in Potential and Overall. I also created a variable Continent.


I extracted numerical value from Wage and Value variable. Further I pulled out most preferred position from Preferred.Position variable.

Bivariate Plots Section


Most regions of the world seem uniform when it comes to Age Distribution with the exception of nations in Africa. There are subtle differences though.

Looking at the overall score across countries, one would think that Mozambique, Oman, and Syria are amongst the countries supplying the best players in the world. It appears fishy as it should, because these countries don’t even have number of players in 2 digits. Syria only has 1, compare this with Brazil, with 812 players and 70.9 overall mean score. I will plot the map once again, using only nations that have atleast 200 players in fifa 2018.

Now it is a better picture. Clearly Brazil and Spain appear to be nations that produce players with better mean overall score.

South America and Europe are more yellow in color compared to other continents. Asia and Africa are more blue side.

I subset data straightaway, because many nations don’t have considerable player count. From the map it is clear once again that South America and Europe tend to be on the higher side of the score. Interestingly Spain has the highest potential, and not Brazil.

Portugal naturally has the highest potential of 94, as Cristiano Ronaldo already has an overall of 94. Spain top most potential is 92 even though it has the highest mean potential.

Once again I took the subset of data, taking into consideration countries that have atleast 200 players listed in fifa. Brazil and Chile appear to be on the level at par with their potential.United Kingdom huge difference is a shocker. Perhaps it is because of younger players that it has.

Western countries appear to be in possession of players most likely to improve. Countries in Asia and Africa, which already have most of their players with low Overall score, also have low Potential score.

Once again Europe and South America are doing better than other continents. Lets see where the most valued player is from.

Most Valued player is from Brazil.


From above curve I see meaningful correlation between the following:

  1. Potential and num_Value
  2. Overall and num_Value
  3. Overall and Potential
  4. Age and Overall


Normally one would expect value of a player to rise with potential and looking at correlation it does appear so. However, there are players with potential above 90 and value only 975K. Maybe the player preferred position has an impact on salary.

I also notice that the points are discrete. Lets plot a jitter plot to add some smootheness.

It is easier to point distribution now. I’ll do the same for Overall variable.

Same story with overall, value does increase with overall score but there are some players with high overall score and less value. I wonder why are they undervalued? I would explore this further in multivariate plots.

Once again it is easier to see distribution with jitter.

The darkest points lie on x=y ab line, meaning that these players have attained their potential. Interestilngly there are many points below this line.

A beautiful curve. Football is physically intensive, so one would expect a player to lose overall points with age. However in the early stages perhaps because of lack of experience player gains points with age and then after mid twenties the curve tends to become uniform. When the effect of age starts taking place somewhere around 33 the overall points start to decrease. the curve confirms the intuition about the effect of age on players overall performance.

Bivariate Analysis

I have made following observations :

  1. Of all the conitnents South America and Europe are the best in Overall scores both in terms of mean overall score and max overall score.
  2. Europe and South America also have the higher valued players compared to other continents.At 123 Million Euro Brazil hast the highest valued player.
  3. Interestingly even with high overall score, many players are under valued. Maybe it is because of their preferrence of position or nationality. I will explore this point further.

Multivariate Plots Section


Looking at the plots above I see that defensive positions like LB and RB are valued less. Forward and Strike positions are worth more. Here is a link for description of positions https://en.wikipedia.org/wiki/Association_football_positions.



For the same score a forward preferrence player is worth more than those who prefer a more center position or defensive position.

A players value does not seem to be dependent on his continent.


Players Worth more have higher Overall score and tend to be aged between 25 and 30.

Multivariate Analysis

The following relationships have been observed:

  1. A players preferred position affects his worth. If he prefers a forward position, for the same overall score he is worth more than the players who prefer Middle of Defensive positions.
  2. A players continent does not seem to be effecting his worth.





——

Final Plots and Summary

Plot One

Description One

I chose this plot because it clearly shows how the players in FIFA 2018 are distributed by country. Most of the players are from South America and Europe. UK has the highest number of players featured.

Plot Two

Description Two

Even though I expected that a player would be worth less in the initial years of his career, and then stabilize in his prime years and eventually going down in worth because of his age. The curve reaffirms my intuition. It shows that for younger ages (16-22) the players worth rises with age, it then becomes stable and eventually from starts a downward curve (age 33). Maybe a younger player gains popularity and thus increases his worth or he improves his overall score and thus becoming more valuable. A good example is UK, UK has the youngest mean age and highest mean potential.

Plot Three

Description Three


I am choosing the above plot as the final descriptive plot because it completes the story. While a players overall score is a good indicator of his worth, his preferred position impacts his worth immensely. Forward playing players are more likely to be worth more at the same overall score than Middle or Backward playing players.


Reflection

Overall I selected important columns that would allow me to form insights about characteristics of players featured in FIFA 2018.

Conclusion

Most players featured in FIFA 2018 are from South America and Europe. Most of them are clustered around 25 years of age. And finally most of them have their overall performance score as 66. Interestingly most players are at their best. Younger player have a better chance of improving. A players value is affected by his preferred position. These are the conclusions that I have made after exploratory analysis of the dataset of FIFA 2018.

Limitations

The dataset is limited as it only pertains to data of FIFA 2018. I would have loved to explore evolution of players overall performance feature and value feature over a period of time. The wage column was mostly missing. I could not form any meaningful insights through it.

Future Work

The analysis that I have performed can be extended further to produce a best squad with budget. It could also be extended to address questions such as if a 2-3-5 (pyramid) formation is better than a 4-2-4 formation for the squad. Or if performance of team would improve if the cclub invests in a new player.


References

  1. https://plot.ly/r/
  2. https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset
  3. https://en.wikipedia.org/wiki/Association_football_positions
  4. https://en.wikipedia.org/wiki/Formation_(association_football)#2%E2%80%933%E2%80%935_(Pyramid)